In [1]:
import pandas as pd
import re

First we'll import the data


In [5]:
df = pd.read_csv('allPostText_test.csv')

In [9]:
df['Text'].head()


Out[9]:
0    \n« back to home\n  You can wear the flashiest...
1    \n« back to home\n  Exit polls show that Sebas...
2    \n« back to home\n  Is it going to be a five-y...
3    \n« back to home\n  The disgraceful thing is t...
4    \n« back to home\n  “I don’t recall any other ...
Name: Text, dtype: object

Lets clean it a bit


In [12]:
def clean(elem):
    elem = elem.replace('\n« back to home\n  ', '')
    elem = elem.replace('« previous postnext post »', '').strip()
    return elem

In [14]:
df['Text'].apply(clean).head(10)


Out[14]:
0    You can wear the flashiest watch and keep your...
1    Exit polls show that Sebastian Kurz, 31, is ab...
2    Is it going to be a five-year electoral campai...
3    The disgraceful thing is that this man has bee...
4    “I don’t recall any other budget having given ...
5                                                     
6                                       I mean, really
7    Toni Bezzina, the member of parliament, entere...
8                                                     
9    David Agius today came forward officially as a...
Name: Text, dtype: object

In [17]:
df['Text'] = df['Text'].apply(clean)

In [19]:
df = df[df['Text'] != '']

Lets find some text and numbers

First a simple search.


In [23]:
text = df['Text'][10]

In [38]:
text


Out[38]:
'Yesterday in the car I was listening to the lunchtime talk-show on the Nationalist Party’s radio station, hosted by Evelyn Vella Brincat, whose brother is the failed party leadership contender Frank Portelli. It was unbearable, but I felt I needed to suffer through it in the interest of journalism. David Agius, the party whip and contender for the post of deputy leader, was on with her.I finally switched off when Mrs Vella Brincat announced that it was David Agius’s birthday – how old is he, 10? – that those listening to the show should give him “the best birthday present ever by becoming members of the Nationalist Party, because I have known David for a long time and he has always been a party boy so he will want that more than anything” (translated from the Maltese).Then the intellectually challenged and free-loading Mr Agius, who should have been at his state-paid job at the Freeport at that time of day, interjected and said that what he wants more than anything is: “Li nara l-Partit Nazzjonalista jinżel fil-grawnd, jilgħab il-logħba mal-Partit Laburista u jirbaħ.” (“To see the Nationalist Party walk onto the football pitch, play a game against the Labour Party, and win.”)This is what it has come to: the triumph of evil and idiocy, a kakocracy  on one side of the House and an idiocracy  on the other.Nationalist Party deputy leadership contender David Agius (left) with party leader Adrian Delia: the serious business of the running of the country treated like a game of football in which the sole aim is for the Nationalist Party to ‘win the game’ against Labour.'

In [30]:
re.search(r'car', text)


Out[30]:
<_sre.SRE_Match object; span=(17, 20), match='car'>

In [36]:
re.search(r'car', text).group()


Out[36]:
'car'

In [37]:
re.findall(r'car', text)


Out[37]:
['car']

In [42]:
# Now lets match patterns

In [35]:
re.search(r'[0-9]', text).group()


Out[35]:
'1'

In [39]:
re.findall(r'[0-9]', text)


Out[39]:
['1', '0']

In [41]:
re.findall(r'[0-9]+', text)


Out[41]:
['10']

lets go to https://regexr.com to practice. Try to match the names in the text above.

Regexing


In [45]:
re.findall(r'[A-Z]\w+\s[A-Z]\w+', text)


Out[45]:
['Nationalist Party',
 'Evelyn Vella',
 'Frank Portelli',
 'David Agius',
 'Mrs Vella',
 'David Agius',
 'Nationalist Party',
 'Mr Agius',
 'Partit Nazzjonalista',
 'Partit Laburista',
 'Nationalist Party',
 'Labour Party',
 'Nationalist Party',
 'David Agius',
 'Adrian Delia',
 'Nationalist Party']

lets make a function including a Regex


In [46]:
def regexing(elem):
    lst = re.findall(r'[A-Z]\w+\s[A-Z]\w+', elem)
    return lst

In [50]:
df['Names'] = df['Text'].apply(regexing)


/Users/barneyjs/.virtualenvs/master/lib/python3.5/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [53]:
list(df['Names'])[:5]


Out[53]:
[[],
 ['Sebastian Kurz',
  'Freedom Party',
  'Social Democrats',
  'Sebastian Kurz',
  'Christian Kern',
  'Social Democrats',
  'Christian Strache',
  'Sebastian Kurz'],
 ['Labour Party',
  'Prime Minister',
  'Nationalist Party',
  'Rank Xerox',
  'Kristy Debono',
  'Hermann Schiavone',
  'Nationalist Party'],
 ['Nationalist Party', 'The Nationalist', 'European Union'],
 []]

In [55]:
lst = list(df['Names'])

Now we want to see, who is mentioned the most


In [58]:
# First we need to delete the empty lists:
lst = [x for x in lst if x != []]

In [59]:
lst


Out[59]:
[['Sebastian Kurz',
  'Freedom Party',
  'Social Democrats',
  'Sebastian Kurz',
  'Christian Kern',
  'Social Democrats',
  'Christian Strache',
  'Sebastian Kurz'],
 ['Labour Party',
  'Prime Minister',
  'Nationalist Party',
  'Rank Xerox',
  'Kristy Debono',
  'Hermann Schiavone',
  'Nationalist Party'],
 ['Nationalist Party', 'The Nationalist', 'European Union'],
 ['Toni Bezzina',
  'Nationalist Party',
  'MP Robert',
  'Edwin Vassallo',
  'David Agius',
  'Mr Agius',
  'Toni Bezzina',
  'Robert Arrigo',
  'Nationalist Party'],
 ['David Agius',
  'Nationalist Party',
  'Edwin Vassallo',
  'Nationalist Party',
  'Chris Said',
  'Adrian Delia',
  'Dr Said',
  'Dr Said',
  'Mr Vassallo',
  'Dr Said',
  'David Agius',
  'Dr Said',
  'Mr Agius',
  'Dr Delia',
  'Dr Said',
  'Dr Delia',
  'Dr Said',
  'Mr Agius',
  'Dr Delia',
  'Nationalist Party',
  'Robert Arrigo',
  'Clyde Puli',
  'When David',
  'Chris Said',
  'Mr Puli',
  'Dr Delia',
  'Edwin Vassallo',
  'Dr Delia',
  'Mr Agius',
  'Robert Arrigo',
  'Dr Delia'],
 ['Nationalist Party',
  'Evelyn Vella',
  'Frank Portelli',
  'David Agius',
  'Mrs Vella',
  'David Agius',
  'Nationalist Party',
  'Mr Agius',
  'Partit Nazzjonalista',
  'Partit Laburista',
  'Nationalist Party',
  'Labour Party',
  'Nationalist Party',
  'David Agius',
  'Adrian Delia',
  'Nationalist Party'],
 ['The Malta',
  'Mrs Delia',
  'Rebecca Dimech',
  'Mrs Delia',
  'Miss Dimech',
  'Rebecca Dimech',
  'Mrs Delia',
  'Miss Dimech',
  'Miss Dimech',
  'Andre Falzon',
  'Mrs Delia',
  'The Opposition',
  'Miss Dimech',
  'Adrian Delia',
  'Massimo Dutti',
  'Rebecca Dimech',
  'Dr Nickie',
  'Mrs Delia',
  'Rebecca DimechRebecca',
  'The Opposition',
  'Rebecca Dimech'],
 ['Mrs Adrian',
  'Elisabetta Franchi',
  'Bisazza Street',
  'Dizz Group',
  'Mrs Delia',
  'Mrs Delia',
  'HSBC Bank',
  'Dr Delia',
  'The Elisabetta',
  'Bisazza Street',
  'Mrs Delia'],
 ['Nationalist Party', 'New Way'],
 ['Nationalist Party'],
 ['Louise Tedesco',
  'Nationalist Party',
  'The Great',
  'Democratic Uzbekistani',
  'Nationalist Party',
  'Smashing Uzbekistan'],
 ['Nationalist Party',
  'Adrian Delia',
  'Chief Justice',
  'Massimo Dutti',
  'The Point',
  'Maltese Constitution',
  'Massimo Dutti',
  'Massimo Dutti',
  'Massimo Dutti',
  'Nationalist Party',
  'Dr Delia',
  'Nationalist Party',
  'Dr Delia',
  'Nationalist Party',
  'Last Friday',
  'Massimo Dutti',
  'The Point',
  'Chief Justice',
  'Nationalist Party',
  'Nationalist Party',
  'Nationalist Party',
  'Adrian Delia',
  'Adrian Delia',
  'Massimo Dutti',
  'Chief Justice',
  'Nationalist Party'],
 ['Labour Party',
  'Nationalist Party',
  'Adrian Delia',
  'Nationalist Party',
  'Nationalist Party',
  'Nationalist Party',
  'Frank Portelli',
  'Nationalist Party',
  'Nationalist Party',
  'Nationalist Party',
  'Mrs Delia',
  'Malta Today',
  'Nationalist Party',
  'Nationalist Party',
  'Nationalist Party',
  'Nationalist Party',
  'European Parliament',
  'Nationalist Party',
  'Frank Portelli'],
 ['Adrian Delia', 'Chief Justice', 'Sander Borg', 'Chief Justice'],
 ['Independence Day',
  'Mrs Delia',
  'Mrs Muscat',
  'So Excited',
  'At The',
  'First Rains',
  'Wore My',
  'Women In',
  'Public Life'],
 ['Adrian Delia', 'Chief Justice', 'The Malta'],
 ['Prime Minister', 'Sunita Mukhi'],
 ['An Opposition', 'An Opposition'],
 ['Adrian Delia',
  'Keith Schembri',
  'Pasta Rummo',
  'Pasta Rummo',
  'Keith Schembri'],
 ['Great Gay',
  'Political War',
  'Joseph Muscat',
  'Labour Party',
  'Labour Party',
  'Gabi Calleja',
  'Nationalist Party',
  'Nationalist Party',
  'CC Bill',
  'Aequitas Management',
  'Aequitas Legal',
  'Georg Sapiano',
  'Adrian Delia',
  'Gabi Calleja',
  'Joseph Muscat'],
 ['Adrian Delia',
  'Nationalist Party',
  'Joseph Muscat',
  'Keith Schembri',
  'Christian Kalin'],
 ['Last November',
  'Prime Minister',
  'New York',
  'Finance Minister',
  'Central Bank',
  'Malta Financial',
  'Services Authority',
  'Central Bank',
  'Finance Minister',
  'Prime Minister',
  'New York',
  'Joe Bannister'],
 ['Nationalist Party',
  'Adrian Delia',
  'Far Right',
  'Kristy Debono',
  'Mrs Delia',
  'Mrs Delia',
  'Jean Pierre',
  'Kristy Debono'],
 ['Nationalist Party'],
 ['Frank Portelli',
  'Labour Party',
  'Nationalist Party',
  'Edwin Vassallo',
  'Nationalist Minister',
  'European Commissioner',
  'Tonio Borg',
  'GRANDE ADRIAN',
  'Dr Borg',
  'Tonio Borg',
  'Nationalist Party',
  'Dr Borg',
  'Far Right',
  'Former Nationalist',
  'European Commissioner',
  'Tonio Borg',
  'Nationalist Party',
  'Adrian Delia'],
 ['Anton Rea', 'Planning Authority'],
 ['Anton Rea',
  'Keith Seychell',
  'Environment Minister',
  'VAT Department',
  'Inland Revenue',
  'Planning Authority',
  'Mr Cutajar',
  'Prime Minister',
  'Prime Minister',
  'Planning Authority',
  'Prime Minister',
  'Anton Rea',
  'Mr Cutajar'],
 ['Anton Rea',
  'Planning Authority',
  'Mr Cutajar',
  'Prime Minister',
  'Mrs Muscat',
  'Labour Party',
  'Environment Authority',
  'Planning Authority',
  'Mr Cutajar',
  'Mr Cutajar',
  'Anton Rea',
  'Mr Cutajar',
  'Prime Minister'],
 ['Anton Rea',
  'Environment Minister',
  'Jose Herrera',
  'Capo Crudo',
  'Keith Seychell',
  'Silvio Parnis',
  'Global Capital',
  'Chris Pace',
  'Ivan Portelli',
  'VAT Department',
  'Inland Revenue',
  'Planning Authority',
  'Mr Cutajar',
  'The Planning',
  'Mr Cutajar',
  'Labour Party',
  'Fifth District',
  'Prime Minister',
  'Prime Minister',
  'Kurt Farrugia',
  'Economy Minister',
  'Labour Party',
  'Chris Cardona',
  'Mr Cutajar',
  'Prime Minister',
  'Prime Minister',
  'Mr Cutajar',
  'The Prime',
  'Planning Authority',
  'Gulf States',
  'Mr Cutajar',
  'At Capo',
  'Keith Seychell',
  'Environment Minister',
  'Mrs Jose',
  'Keith Seychell',
  'Parliamentary Secretary',
  'Silvio Parnis',
  'Chris Pace',
  'Anton Rea',
  'Ivan Portelli',
  'VAT Department',
  'Inland Revenue',
  'Samantha Portelli',
  'Anton Rea',
  'Labour Party',
  'Anton Rea',
  'Labour Party',
  'Anton Rea',
  'Kurt Farrugia',
  'With Mrs',
  'Prime Minister',
  'Anton Rea'],
 ['The British',
  'Foreign Secretary',
  'Prime Minister',
  'Foreign Secretary',
  'Iron Curtain'],
 ['Peter Micallef'],
 ['Leonardo Fasoli',
  'Maddalena Ravagli',
  'Maze Pictures',
  'The UK',
  'Walter Presents',
  'Walter Iuzzolino',
  'Walter Presents',
  'The Godfather',
  'Mad Men',
  'Kim Rossi'],
 ['Jean Pierre', 'Nationalist Party', 'Nationalist Party'],
 ['World Economic'],
 ['John Bundy',
  'Public Broadcasting',
  'Services Ltd',
  'Burmarrad Commercials',
  'Public Broadcasting',
  'Services Ltd',
  'In John',
  'Public Broadcasting',
  'John Bundy',
  'John Bundy',
  'Public Broadcasting',
  'Services Ltd',
  'Prime Minister',
  'Joseph Muscat',
  'Labour Party'],
 ['The Malta',
  'Nationalist Party',
  'Nationalist Party',
  'Adrian Delia',
  'Panama Papers',
  'Nationalist Party',
  'Nationalist Party',
  'Nationalist Party',
  'Nationalist Party',
  'Adrian Delia'],
 ['Nationalist Party', 'Kristy Debono', 'Barbie Does'],
 ['Nationalist Party',
  'Nationalist Party',
  'Wales Road',
  'Manwel Dimech',
  'Café Giorgio',
  'Nationalist Party',
  'The Nationalist',
  'Karol Aquilina',
  'Marlene Farrugia',
  'Dr Farrugia',
  'Nationalist Party',
  'Familiar Sliema',
  'Ivan Bartolo',
  'Jean Pierre',
  'Kristy Debono',
  'Robert Arrigo'],
 ['Rebecca Dimech',
  'Miss Dimech',
  'Andre Falzon',
  'Adrian Delia',
  'Nationalist Party',
  'Rebecca Dimech',
  'Andre Falzon',
  'Rebecca Dimech',
  'Nationalist Party',
  'Adrian Delia',
  'Andre Falzon',
  'Nationalist Party',
  'The PN'],
 ['Nationalist Party', 'That Maltese', 'Frank Portelli'],
 ['Clyde Puli',
  'Nationalist Party',
  'Nationalist Party',
  'Something Bad',
  'Medical Services',
  'Nationalist Party',
  'Labour Party',
  'Clyde Puli',
  'Adrian Delia',
  'New Way',
  'Medical Services',
  'Adrian Delia',
  'Clyde Puli',
  'Santa Marija',
  'Clyde Puli',
  'Did On',
  'My Summer',
  'Parliamentary Secretary',
  'Olympic Games',
  'Olympic Games',
  'Parliamentary Secretary'],
 ['Boris Johnson', 'Rudyard Kipling', 'Foreign Secretary'],
 ['Nationalist Party'],
 ['Nationalist Party',
  'Adrian Delia',
  'Miss Rebecca',
  'Miss Melanie',
  'Rebecca Dimech',
  'Mrs Delia'],
 ['Joseph Muscat', 'Adrian Delia', 'Local Councils', 'Nationalist Party'],
 ['Local Councils',
  'The Times',
  'The Malta',
  'Government Gazette',
  'Nationalist Party',
  'Labour Party',
  'Adrian Delia',
  'Joseph Muscat'],
 ['Melanie Gregory',
  'Rebecca Dimech',
  'Labour Party',
  'Adrian Delia',
  'Simple Pleasures',
  'Mrs Delia',
  'Miss Dimech',
  'Mrs Delia',
  'Mrs Delia',
  'Melanie Gregory',
  'Rebecca Dimech',
  'Miss Dimech'],
 ['Victor Calleja', 'Marlene Farrugia'],
 ['Hubert Zammit', 'Nationalist Party'],
 ['Adrian Delia', 'Saviour Balzan'],
 ['Joseph Muscat', 'Labour Party', 'And Delia', 'Air Malta'],
 ['Independence Day', 'Remembrance Sunday', 'Victoria Beckham', 'Mrs Beckham'],
 ['Nationalist Party',
  'Adrian Delia',
  'Paid By',
  'The State',
  'But Never',
  'Goes In',
  'To Work',
  'Paid By',
  'The State',
  'But Never',
  'Goes In',
  'To Work',
  'Clyde Puli',
  'Prime Minister',
  'David Agius',
  'Nationalist Party',
  'Nationalist Party'],
 ['Adrian Delia', 'Nationalist Party', 'The Nationalist'],
 ['Nationalist Party',
  'Labour Party',
  'Prime Minister',
  'The Prime',
  'Labour Party',
  'Nationalist Party'],
 ['Private Eye', 'Prime Minister'],
 ['When Therese',
  'Comodini Cachia',
  'European Parliament',
  'Nationalist Party',
  'Now Jean',
  'Pierre Debono',
  'Comodini Cachia',
  'European Parliament',
  'Jean Pierre',
  'Nationalist Party',
  'The Delia',
  'Nationalist Party',
  'Will Adrian',
  'Rebecca Dimech',
  'Surely NET',
  'Super One',
  'Mario Frendo'],
 ['Nationalist Party', 'Adrian Delia'],
 ['Prime Minister', 'Adrian Delia'],
 ['Angela Merkel', 'Nationalist Party'],
 ['Adrian Delia',
  'Jean Pierre',
  'The PN',
  'Nationalist Party',
  'Adrian Delia',
  'Jean Pierre',
  'Nationalist Party'],
 ['Adrian Delia', 'Jean Pierre'],
 ['Adrian Delia', 'The Sunday', 'Keith Schembri'],
 ['Matthew Xuereb', 'Adrian Delia'],
 ['Nationalist Party',
  'Jean Pierre',
  'Malta Today',
  'Jean Pierre',
  'Jean Pierre',
  'When Delia',
  'The Malta',
  'Jean Pierre',
  'Adrian Delia',
  'Jean Pierre'],
 ['Kevin Cassar',
  'Malta Medical',
  'Nationalist Party',
  'Jean Pierre',
  'Professor Cassar',
  'Eddie Fenech',
  'Adrian Delia',
  'Independence Day',
  'Dr Delia',
  'Dr Fenech',
  'Professor Cassar',
  'Beppe Fenech',
  'Nationalist Party',
  'Nationalist Party',
  'Adrian Delia',
  'Dr Delia',
  'Dr Delia',
  'Ethics Committee',
  'Inland Revenue',
  'Birkirkara Football',
  'Dr Delia',
  'Nationalist Party',
  'Nationalist Party',
  'Nationalist Party'],
 ['Jean Pierre', 'The Times', 'Adrian Delia'],
 ['Mrs Muscat', 'Mrs Delia'],
 ['From Adrian',
  'Nationalist Party',
  'Labour Party',
  'Trade Fair',
  'Pierre Portelli'],
 ['The Nationalist',
  'Adrian Delia',
  'Nationalist Party',
  'Prime Minister',
  'Previous Nationalist',
  'Prime Minister'],
 ['Jean Pierre',
  'Nationalist Party',
  'Nationalist Party',
  'Rosette Thake',
  'Simon Busuttil',
  'Jean Pierre',
  'Adrian Delia',
  'Georg Sapiano'],
 ['Adrian Delia',
  'Prime Minister',
  'Mrs Adrian',
  'Armistice Day',
  'Brigata Laburista'],
 ['Alexander Borg',
  'New York',
  'Prime Minister',
  'Borg Olivier',
  'Nationalist Party',
  'Eddie Fenech',
  'Guido Demarco',
  'Fenech Adami',
  'Lawrence Gonzi',
  'Simon Busuttil',
  'Adrian Delia',
  'Borg Olivier',
  'Borg Olivier',
  'Nationalist Party',
  'Eddie Fenech',
  'Lawrence Gonzi',
  'Nationalist Party',
  'George Borg',
  'Dom Mintoff',
  'Alexander Borg',
  'Prime Minister',
  'Borg Olivier',
  'Dawn Adams',
  'Mrs Borg',
  'Roman Catholic',
  'Mrs Borg',
  'Nationalist Party',
  'Adrian Delia'],
 ['Nationalist Party'],
 ['Ivan Bartolo',
  'Nationalist MP',
  'Adrian Delia',
  'Great Leader',
  'Mr Bartolo',
  'Joseph Muscat',
  'European Parliament'],
 ['Adrian Delia', 'Joseph Muscat'],
 ['Independence Day', 'Nationalist Party', 'Michael Corleone'],
 ['If Beluga']]

Flatten list.


In [60]:
flat_list = [item for sublist in lst for item in sublist]

 Now lets count


In [66]:
pd.DataFrame(flat_list)[0].value_counts().head(10)


Out[66]:
Nationalist Party    100
Adrian Delia          45
Prime Minister        25
Labour Party          19
Mrs Delia             17
Jean Pierre           16
Dr Delia              13
Mr Cutajar            11
Anton Rea             11
Rebecca Dimech        11
Name: 0, dtype: int64

In [68]:
df_names = pd.DataFrame(flat_list)

In [69]:
df_names.to_csv('names.csv')

In [ ]: